Skip to content

Use safe_denominator to prevent division by zero.#489

Open
odiazib wants to merge 2 commits intomainfrom
oscar/hetfrz-safe-denominator
Open

Use safe_denominator to prevent division by zero.#489
odiazib wants to merge 2 commits intomainfrom
oscar/hetfrz-safe-denominator

Conversation

@odiazib
Copy link
Contributor

@odiazib odiazib commented Feb 2, 2026

We are encountering a division-by-zero error in hetfrz. We rely on safe_denominator to prevent this. Note that PR #481 is also intended to address the same issue. We need to decide which approach to merge into master. @susburrows

@odiazib odiazib requested a review from kaizhangpnl February 2, 2026 16:40
Copy link

@kaizhangpnl kaizhangpnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. It looks good to me.

@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.35%. Comparing base (34c4d5c) to head (92f7c2f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #489      +/-   ##
==========================================
- Coverage   93.43%   93.35%   -0.08%     
==========================================
  Files         303      303              
  Lines       25177    24287     -890     
  Branches     2766     2766              
==========================================
- Hits        23523    22673     -850     
+ Misses       1654     1614      -40     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@susburrows susburrows left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @odiazib and @kaizhangpnl: I reviewed these changes and I think there may be a better approach. Can you please check my suggestion?

Real coat_ratio2 =
haero::max(6.0 * dr_so4_monolayers_dust * vol_core[0], 1.0e-36);
dstcoat[0] = coat_ratio1 / coat_ratio2;
haero::max(6.0 * dr_so4_monolayers_dust * vol_core[0], 0.0);
Copy link

@susburrows susburrows Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading the code correctly here, coat_ratio2 returns zero only in the case where vol_core = 0. (The other terms in this equation are constants.)

In turn, vol_core = 0 only when dmac = 0.

dmac is the dust mass air concentration. coat_ratio2 is used here to calculate the surface coating of the dust aerosol present in the air. If there is no dust aerosol (dmac=0), then it makes sense for coat_ratio2 to return NaN, because the entire calculation is physically meaningless in this condition.

Because of this, I suggest the following may be a more appropriate and efficient solution: check whether dmac exceeds a minimum threshold at line 1018; if not, skip the calculations in lines 1018-1029 entirely.

Similar reasoning can be applied to all of the subsequent code blocks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a bug here, since vol_core[0] is calculated as the BC core, not the dust core:

vol_core[0] = bcmpc / (specdens_bc * air_density);

In the original Fortran code, vol_core(2) and vol_core(3) are for dust core.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vol_core[1] is used (vol_core(2) in Fortran), I think Susannah's fix is good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that mam4xx is doing the same thing as the Fortran code:

https://github.com/eagles-project/e3sm_mam4_refactor/blob/2e58607847ef2868e390c564039e8f04d648d9ac/components/eam/src/physics/cam/hetfrz_classnuc_cam.F90#L1127C2-L1130C40

Note that 1 is 0 in c++.

   vol_core(1)  = aer(bc_pcarbon)/(specdens_bc*rhoair)
   coat_ratio1 = vol_shell(1)*(r_bc*2._r8)*fac_volsfc_bc
   coat_ratio2 = max(6.0_r8*dr_so4_monolayers_dust*vol_core(1), 0.0_r8)
   dstcoat(1) = coat_ratio1/coat_ratio2

Copy link

@kaizhangpnl kaizhangpnl Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a double check: for dst_a1 and dst_a3, the code does use dstcoat[1] and dstcoat[2]:

mam4xx/src/mam4xx/hetfrz.hpp

Lines 1030 to 1047 in 92f7c2f

// dust_a1
coat_ratio1 = vol_shell[1] * (r_dust_a1 * 2.0) * fac_volsfc_dust_a1;
coat_ratio2 = haero::max(6.0 * dr_so4_monolayers_dust * vol_core[1], 0.0);
mult_coat_ratio2 = FloatingPoint<Real>::safe_denominator(coat_ratio2, tol);
dstcoat[1] = coat_ratio1 * mult_coat_ratio2;
// dust_a3
vol_shell[2] = so4mc / (specdens_so4 * air_density) +
pommc / (specdens_pom * air_density) +
soamc / (specdens_soa * air_density) +
mommc / (specdens_mom * air_density);
vol_core[2] = dmc / (specdens_dst * air_density);
coat_ratio1 = vol_shell[2] * (r_dust_a3 * 2.0) * fac_volsfc_dust_a3;
coat_ratio2 = haero::max(6.0 * dr_so4_monolayers_dust * vol_core[2], 0.0);
mult_coat_ratio2 = FloatingPoint<Real>::safe_denominator(coat_ratio2, tol);
dstcoat[2] = coat_ratio1 * mult_coat_ratio2;

so this part is correct.

The code seems to use dstcoat[0] (or dstcoat(1) in fortran) to represent coating on BC. This made me confused, but it seems to be correct now to me (it seems the code also assume dr_so4_monolayers_BC = dr_so4_monolayers_dust).

If we want a if-then condition, I think we should determine if vol_core[0] larger than zero directly. On the other hand, I think the current PR code is good enough.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the current PR should be sufficient to stop the NaN errors (at least in this spot). However, an if-then check would allow several computations to be skipped. If this happens frequently enough (and if similar code bloack happen elsewhere), I assume this might result in some performance gains, which would be helpful.

@odiazib , do you have a sense of how much performance could potentially be gained from these types of changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants